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Abstract. Phylogenetic comparative analysis is an approach to inferring evolutionary process from a 
combination of phylogenetic and phenotypic data. The last few years have seen increasingly sophisti¬ 
cated models employed in the evaluation of more and more detailed evolutionary hypotheses, includ¬ 
ing adaptive hypotheses with multiple selective optima and hypotheses with rate variation within and 
across lineages. The statistical performance of these sophisticated models has received relatively little 
systematic attention, however. We conducted an extensive simulation study to quantify the statistical 
properties of a class of models toward the simpler end of the spectrum that model phenotypic evolution 
using Ornstein-Uhlenbeck processes. We focused on identifying where, how, and why these methods 
break down so that users can apply them with greater understanding of their strengths and weaknesses. 
Our analysis identifies three key determinants of performance: a discriminability ratio, a signal-to-noise 
ratio, and the number of taxa sampled. Interestingly, we find that model-selection power can be high 
even in regions that were previously thought to be difficult, such as when tree size is small. On the 
other hand, we find that model parameters arc in many circumstances difficult to estimate accurately, 
indicating a relative paucity of information in the data relative to these parameters. Nevertheless, we 
note that accurate model selection is often possible when parameters arc only weakly identified. Our 
results have implications for more sophisticated methods inasmuch as the latter arc generalizations of 
the case we study. 

Keywords: phylogenetic comparative analysis, adaptation, model selection, evolutionary model, Ornstein- 
Uhlenbeck 
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Centra! to modem comparative methods are the mathematical models that connect phenotypic evo¬ 
lution to the branching pattern of the phylogeny. In recent years, creative application of these models 


has allowed biologists to test complex hypotheses involving adaptive radiation (Butler and King 

2004 

Monteiro and Nogueira 2011, Setiadi et all 2011), multiple rates of evolution (Beaulieu et al. 

2012 

O’Meara et al. 2006 Venditti et al. 2011), and specialization to environmental 

Collar et al. 

2011 

2010 Edwards and Smith 

2010 Gomez and Thery , 2007; Kozak and Wiens 2010 

Labra et al. 

2009) 

or behavioral factors (Col 

ar et al. 2009 Monteiro and Nogueira 2011 Scales et al. 2009). Genuine 


progress is made when meaningful alternative hypotheses arc compared and some are rejected. Ex¬ 
amples include tests of whether muscle properties of animals evolve in response to the need to flee 


predators vs. the need to chase down prey (Scales et al.| 2009) and whether floral evolution is driven by 
pollinator shifts vs. coevolution between flowers and pollinators (i.e., gradual vs. punctuated evolution; 
Whittall and Hodges} 2007 1 . 


The ability to formulate and test these hypotheses has been made possible by an expansion of mod¬ 
eling strategies over the last 30 years. Model-based approaches to phylogenetic comparative methods 
began with application of the Brownian-motion (BM) model, which captures both the stochastic nature 


of evolutionary change in phenotypes and the expectation of similarity due to common ancestry (Felsen- 
stcin 1973} 19851. More recent extensions of BM allow rates of stochastic evolution to vary across the 
phylogenetic tree ( Eastman et al.} 2011 O’Meara et al.[ 2006} Revell et al. 2011 Venditti et al. 20111. 
Lande (1976, 1980) expanded the model repertoire, introducing the Ornstein-Uhlenbeck model as a 
formal means of modeling both natural selection and genetic drift in macroevolution. Inspired by Simp¬ 


son 


s (1953) notion that major evolutionary diversification events occur when species enter new “adap¬ 
tive zones”, Lande conceptualized phenotypic evolution on a fitness landscape, on which adaptation 
pulls traits toward one of several optima and stochastic forces introduce variation in trait values but also, 
importantly, allow switching between adaptive zones. The OU model’s accounting for selection, noise, 
and multiple trait optima allow it to be used to represent alternative evolutionary histories and thus make 
it a valuable tool in phylogenetic comparative analysis. Hansen (1997]) pioneered the use of the OU pro¬ 
cess with multiple optima in comparative analysis. Later, Butler and King!(2004) put Hansen’s method 


into an information-theoretic framework and illustrated how multiple competing evolutionary hypothe¬ 
ses could be tested via the comparison of alternative evolutionary-history scenarios. More recently, the 
SLOUCH model expands the range of testable hypotheses still further, allowing hybrid OU-BM models 


and a form of evolutionary regression between phenotype and covariates (Hansen et al. 2008 Labra 


et al.[ [2009]). Both the OUCH and SLOUCH models have multivariate extensions (Bartoszek et al. 


2012 King and Butler 2009). Another recent expansion implements multiple-optimum OU models in 


which noise intensity and selection strength can vary across lineages (Beaulieu et al.| 2012|). 


It is important to note that the models just described can be used to achieve various goals. When the 
goal is the evaluation of a relatively small set of distinct, biologically informed hypotheses, information- 


based model selection approaches are useful as a means of evaluating relative explanatory power (Bar- 

toszeket al.| 2012; Beaulieu et al.} 2012 Burnham and Anderson} 2002;}Butler and King, 2004, 

O’Meara 

et al. 

2006). Typically, the selection of models is not an end in itself but is rather an aid to reasoning 


about evolutionary mechanisms and drivers. In such a context, one bears in mind that the larger the 
number of models entertained, the less reliable are the inferences obtained. This is a problem that the 


use of information criteria only incompletely ameliorates 
For this reason, inferences are often clearest when the hypotheses are laid out a priori. 


(Boettiger et al. 2012 Ho and Ane| |2014 


By contrast, when the goal is to reconstruct ancestral states, one wishes to avoid a priori limitation 
on the range of possible histories. In particular, one may seek to infer, from knowledge of the pedigrees 
and traits of extant organisms, ancestral character states and/or historical shifts among selective regimes. 
In such a context, one typically must winnow through a large number of alternative histories to identify 
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those most consistent with the data (Eastman et al. |2011| Hipp and Escudero||2010[|Ingrarn~and Mahler 


2013[ Revell et~ah||2011{ Uyeda and Harmon) 2014 1 . Such an exercise is perhaps best thought of as 


form of exploratory data analysis. 


Regardless of the ultimate goal, however, greater flexibility and realism in models comes at the 
cost of more mathematical complexity, more parameters to estimate, and more models to compare. In 
effect, as parameter space dimension increases, the information contained in data becomes more dilute, 
the precision with which parameters can be estimated diminishes, and the risk of overfitting grows 
(Ho and Ane 2014). We set out to identify the principal determinants of model-selection power and 
parameter estimation accuracy in a class of models of intermediate complexity by means of an extensive 
simulation study using OU models with multiple adaptive optima (the OUCH class of models). The 
R package ouch implements these methods, but other software implementations exist, for example 
in the packages ape (Paradis et al. 2004i, geiger ([Harmon et al.| 2008a), and OUwie (Beaulieu 


et al.[|20l2i. Our results directly concern the reliability of model selection, the accuracy of parameter 


estimation, and the design of comparative studies based on such models. In addition, our results have 
implications for inferences based on more complex models, inasmuch as the latter arc extensions of the 
OUCH class. The results of this study apply directly to the specific models implemented in the ouch 
package and with equal force to other software implementations of the same models and, moreover, to 
more sophisticated models, of which the OUCH models are special cases (e.g., Bartoszek et al. 2012[ 
Beaulieu et al.[ 2012[ Hansen et al. 2008). We detail these implications in the Discussion. 


In the following, we first describe and explain the models explored and point out the key quantities 
informed by data. Next, we outline the design of our simulation study. We then present results regarding 
the determinants of model selection power (probability of choosing the correct model) and parameter 
estimation accuracy (expected error in estimated parameters). We identify three dimensionless numbers 
that effectively determine statistical power and discuss how other features of the data affect inference 
reliability. Some of the latter, such as the size and shape of the phylogenetic tree and the distribution of 
selective regimes on the tree, may be affected by study design via choice of taxa sampled; we quantify 
the relative importance of each of these. Importantly, we find that large errors in parameter estimates are 
quite common even in circumstances under which power is high. Finally, we discuss the implications 
of our results for study design, for the interpretation of analytical results, and for more sophisticated 
OU-based methods. 


Methods 


Models of evolution for quantitative characters. Both Brownian motion (BM) and Ornstein-Uhlenbeck 
(OU) models were used in this study. We provide only a brief account of these models here; both have 
been described in detail previously (Bu tler and King[ 2004; |Hansen[ 1997). The BM model for phe¬ 
notypic evolution assumes that evolutionary changes along a lineage are stochastic and uncorrelated in 
time. In particular, over any time interval, the phenotype changes by a random amount, the magnitude 
of which has mean zero and variance proportional to the duration of the interval and is independent of 
changes over any other interval. Under BM, therefore, observed correlations among the traits of extant 
taxa are attributed to a combination of chance and shared evolutionary history. Mathematically, BM is 
the Ito diffusion 

dX(t) = adB(t), X{0) = X 0 , 


where the phenotypic trait X{t) evolves through time t along each lineage, random deviations are intro¬ 
duced by the Gaussian white noise d B(t )—usefully visualized as a normal random variable with mean 
zero and variance dt—and the magnitude of these deviations depends on the noise intensity, a. Previous 
authors, including ourselves, have used the term “drift” in connection with this noise. It is important to 
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A dX(t) — a (9 — X{t)) dt + o dB{t) 
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FIGURE 1. (A) ouch models the evolution of quantitative traits as an evolutionary 
change in the mean phenotypic value, which depends on the selective regimes operating 
during evolutionary history. Given phenotypic data and a phylogenetic tree, investiga¬ 
tors can specify evolutionary hypotheses by “painting” the branches of the phylogenetic 
tree with the selective regime assumed to be operational along that branch, ouch fits a 
model with common values of a and a, but with different optima on branches as speci¬ 
fied by the investigator. (B-D) Hypothetical alternative hypotheses for the evolution of 
body size in a clade of lizards. (B) A hypothesis of niche partitioning by food type for 
the evolution of body size (numerical values). Animals specializing on an insect diet 
enter a new selective regime indicated by purple. (C) A hypothesis of specialization to 
habitat type, with shifts occurring when tree-dwelling lineages enter a grassy habitat. 
(D) A hypothesis of sexual selection driving evolutionary change in body size, with 
sexual selection indicated by color dimorphism. Each of these evolutionary scenarios 
can be expressed as a separate ouch model. Model fitting can then determine which 
hypothesis is best supported by data. 


note that stochasticity may arise not only from genetic drift but from other, unmodeled factors. Here 
and in the following, we adopt the convention that t = 0 at the root of the phylogeny. When BM is fit 
to data in a comparative analysis, the only parameters to be estimated are a and the phenotype at the 
phylogeny’s root, Xq. 

The OU model expands on BM by including a deterministic component that models the tendency 
of a trait to evolve towards an adaptive optimum. In this model, this tendency is scaled by the parameter 
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a, which quantifies the rate at which the trait is pulled towards an optimum, while the optimum trait 
values, which may change with time and across branches of the phylogeny, are denoted 6{t). From the 
point of view of Simpson (19531 and Lande ( 1976[ |T980 l, a is interpreted as “strength of selection”. 
Formally, the OU process is an Ito diffusion satisfying 


d X(t) = a ( 6(t ) — X(t )) dt + adB(t). 


( 1 ) 


Additional assumptions. To define an OU process, one must make some assumption about the root- 
state, A(0). One may follow Hansen (1997) and Butler and King (2004) in introducing a parameter, 
X() = X(0), as in the BM model. This assumption implies that BM is the special case of the OU model 
for which a = 0. It frequently happens, however, that the data carry little information on Xq. In fact, 
Ane ( 2008) ) proved that under Brownian motion evolution, the estimate of the root-state is inconsistent, 


tending to a random value as sample size increases. Moreover, for OU models, one sometimes finds that 


estimates of Xq and a are correlated. The alternative we adopt is to follow Hansen et al. (2008) and Ho 


and Ane (2013) in assuming that X(0) is distributed according to the stationary distribution of an OU 


process: 


X(0) 


normal 


2 a 


( 2 ) 


2 

[The stationary trait distribution about a particular optimum is normal with mean 6 and variance f-.] It 
is important to note that, under this definition, BM is no longer nested within OU. Nor can we redefine 
BM to make it so, since BM lacks a stationary distribution. 


In the foregoing, the time-dependence of the optimum, 9, has been left general. The OUCH class 
of models restricts this dependence to a particular form: 6{t) is piecewise constant on intervals and 
takes values in a finite set {#«}, the set of “selective regimes” or just “regimes” (a term we prefer to 
Simpson’s “adaptive zones”). Moreover, it is assumed that the intervals and regimes are known, though 
the optimum trait values associated with each regime are not. As an aid to visualization, one can 
associate a color with each selective regime. Applying the corresponding color to each interval on the 
phylogeny, one obtains a “painting” of the phylogeny (Fig. [T]). The OUCH class further assumes that 
a and er are constants. A frequent misconception about this class of models is that regime shifts must 
co-occur with phylogenetic branch points within this class of models. This is incorrect: one can work 
around a software implementation that allows regime shifts only at nodes by inserting additional nodes 
as needed. 


Dimensionless parameter combinations. The fact that the variables and parameters of the OU model 
have units presents a challenge to the design of a simulation study. In Eq. [T] in particular, X and 9 
share the same units, which depend on the trait in question, the tree depth T has the same units as t, 
a has units of inverse time, d B has units of time to the | power, and a, therefore, has units of trait 
times time to the — ^ power. The statistical performance of any model, however, cannot depend on the 
arbitrary choice of units. Thus, while a change in the scales used to measure t or X will change the 
numerical values of a, a, and 0j, statistical inference must depend only on dimensionless combinations 
of these parameters ( Buckingham[ 1914 1 . We can identify these dimensionless combinations via the 
simple mathematical method of non-dimensionalization, which involves standardizing the problem by 
rescaling model variables. Non-dimensionalization reduces the number of parameters to the minimal set 
needed to produce the full range of model behaviors. While there is typically some freedom in the choice 
of scaling, judicious choice results in dimensionless parameters that have meaningful interpretation (for 
example, for swimming animals, non-dimensionalizing the Navier-Stokes equations of fluid dynamics 
by animal length and speed leads to the Reynolds number, which determines whether flow is laminar or 
turbulent). [Strang ( jl986 1 and Murray (2002 > give good expositions of non-dimensionalization. Here, 
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we derive three dimensionless quantities that, as we will show, largely determine the statistical power 
of OU-based model selection and the accuracy of parameter estimates. 

We proceed by identifying natural scales on which to measure the variables. An obvious choice 
of time-scale is the total depth, T, of the phylogeny. Accordingly, we define the dimensionless time 
t = t/T. Several choices of scaling for X are available. In earlier studies, we and others have 
used an arbitrary scaling for X. Alternatively, letting Ad denote the smallest difference between 
two selective regimes, one can define the dimensionless trait X = X/A9 and dimensionless optimum 
6{t) = 9(t/T)/A9. Then the OU process for lineage i obeys 



r 



= (aT) 

6i(t)-Xi 

dt+ AO ) dB ' = " 

diW-Xi 


where d B is an increment of a dimensionless Brownian motion process. The dimensionless parameter 
77 = aT can be viewed as a measure of the depth of the phylogenetic tree relative to the characteristic 
timescale associated with adaptation. Herein, we refer to 7/ as the selection opportunity , since it involves 
both the strength, a, of selection and the amount of time it has had to work. It is p roport ional to the 


depth of the phylogeny measured relative to the phylogenetic half-life (Hansen 


1997). Note that 


r) is not to be confused with the population-genetic concept of opportunity for selection (Arnold and 
Wade 1984). The quantity 7 = is a dimensionless measure of the noise intensity. 


Two particular combinations of p and 7 will turn out to be important from the point of view of statis¬ 
tical analysis. The first, 0 = A6> —which we will call the discriminability ratio —measures 

the smallest difference between selective optima relative to the standard deviation of the stationary trait 
distribution. When 0 is large, the optima are distinct relative to chance variability around them; when 0 
is small, chance variability will obscure the difference between the closest optima, even when selection 
has been acting over very long times. The second quantity, ^Jp 0 , combines the selection opportunity 

and the discriminability ratio. Rewriting, we see that yjp 0 = A6> is proportional to the ratio of 

a, which quantifies the strength of the deterministic signal in Eq. [I] to the noise intensity, a. For this 
reason, we refer to ^Jp 0 as the signal-to-noise ratio (SNR). It will emerge that the SNR is a predictor 
of statistical power. 


It is worth noting at this point that, while the particular non-dimensionalization we have chosen 
leads to insight into the statistical properties of OU-based comparative methods, it does not immediately 
translate into a prescription for data analysis, i.e., it is not straightforward to fit this dimensionless model 
to data. We return to this issue in the Discussion. 


Plan of the simulation study. The simulation study was designed to investigate the influence of aspects 
of the data-generating process on model-selection power and parameter-estimation accuracy. We varied 
the number of taxa sampled, the values of a and <7, and the arrangement (“painting”) of selective regimes 
over the phylogeny. Since tree height, T and minimum inter-optimum distance A 6 set the scales of time 
and phenotype, respectively, we fixed T = 1 and A 9 = 1 without any loss of generality. We chose 
( 77 , 0, tree size) triples over a broad range of parameters using a Latin hypersquare design (see below 
and Fig. |A2[ ). We simulated phylogenetic trees, regime paintings, and phenotypic data under each 
parameter set and fit each of six competing models to the data as described below. We assessed model- 
selection power as the probability that the true model had the lowest AIC c score. In addition, when the 
true model was selected, we assessed parameter-estimation accuracy as mean squared error (MSE) in 
each of the estimated parameters. We investigated how power and accuracy depend upon the selection 
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opportunity, discriminability ratio, tree size, tree balance, and evenness of regime representation across 
both the tips and the branches of the phytogeny. 


Fig. |A 1 1 gives a schematic overview of the study design. In all, the simulation study used 220 
parameter triples, 40 sets of regime paintings per parameter triple, 40 trees of each size per regime 
painting, and 40 sets of simulated phenotypic data per tree. In all, then, we fit each of the six models 
to 14.08 X 10 6 simulated datasets, using about 7 x 10 3 cpu-hr of computation. Further details of 
the simulation study are given below. Computations were carried out in the R statistical computing 
environment ( R Core Team} 20131. We next describe the simulation study in more detail. All codes 
necessary to fully reproduce this study in all particulars are provided in the Supplement. 


Generating random phytogenies. Because the raw data for phylogenetic comparative analyses are typ¬ 
ically hard-won and statistical performance is most an issue when sample sizes are small, we focused 
our investigation on phylogenies of between 10 and 50 extant species, a modest number representative 
of many studies in the literature. We encountered no reason to doubt that larger sample sizes result in 
better inferences generally. Ideally, we would simulate phylogenies from a model known to generate 
realistic trees. However, some have questioned the ability of generic tree models, such as pure birth or 
birth-death processes, to generate trees with shape properties similar to empirical phylogenies (Mooers 
and Heard| 19971. More generally, we know of no generative model for realistic phylogenies. We there¬ 
fore randomly subsampled a large empirical ultrametric phylogeny of 187 species of Caribbean Anolis 
lizards (Nicholson et al.[, (2005 l. Although phylogenies so generated may or may not be realistic, they 
are indubitably real. Note too that this method of generating phylogenies bears some resemblance to 
that in which real phylogenies are obtained, viz., by subsampling the Tree of Life. Be this as it may, the 
phylogenies so generated display a wide range of topologies and shapes. Fig. |A3| shows the distribution 
of several metrics of tree shape (Mooers and Heard| 1997 1 for the 65600 distinct trees used in this study. 


Selecting model parameters. In order to develop an understanding of where, how, and why statistical 
performance of the OUCH-class methods varies, we focused on the region of parameter space where 
statistical power and parameter-estimation accuracy range from very high to quite low. In particular, we 
varied the dimensionless parameters rj and <i> over the range (0.2, 5), which represents a broad spectrum 
of scenarios, from ones in which the data contain very little information about the evolutionary process 
to ones where recovery of the true model is nearly guaranteed. As rj varies from 0.2 to 5, the depth of 
the tree as measured in phylogenetic half-lives grows from 0.29 to 7.2. As 6 varies over the same range, 
the separation of adjacent optima grows from 1/5 to 5 standard deviations of the stationary distribution. 
As discussed above, we varied tree size (number of sampled taxa) from 10 to 50. Specifically, we 


used a Sobol’ low-discrepancy sequence (Press et al. 1992) to generate 220 ( 77 , 0. tree size) triples. We 


computed a and a according to the relations described above. Fig. A2 shows the range of parameters 
used in the study. 


Randomizing over disposition of selective regimes. 40 sets of 3-regime paintings were simulated on 
the large Anolis phylogeny by assuming that regimes shifts occur according to a Poisson process along 
each lineage and that shifts occur randomly among the regimes. Specifically, the expected waiting 
time between shifts was 0.6 T and the identity of each new regime was selected randomly with equal 
probability from among the two candidates. The 40 regime paintings are shown in the Supplement. 
Each sampled subtree inherited its painting from the large phylogeny. 


Computing tree and regime metrics. Tree shape and the distribution of regimes are to some extent a 
function of experimental design, as they are determined by the choice of taxa included in a study. We 
therefore aimed to quantify their effects on statistical performance. For each subtree, we calculated 
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several measures of tree shape and regime distribution. To quantify tree shape, we required a measure 
able to accommodate polytomies. We chose a measure based on Sackinf s (1972]) index, as follows. 
For each tip i = 1,..., n, let N r be the number of nodes between the tip and the root (including the 
root). The average, N = F over all tips yields a measure of imbalance. We normalized this 


index by dividing by its expectation, E [AT] = 2 ]U” =2 *, under a coalescent model (Kirkpatrick and 
Slatkin] 19931, which yields a quantity roughly independent of tree size. We refer to this metric as tree 


imbalance since larger values indicate a more unbalanced tree. 


We also assessed how the distribution of selective regimes among tips and along lineages affects 
statistical performance. To calculate tip evenness, Jr, we used the standardized Shannon diversity index 


Jt 


~ J2i=lPi lQ g Pi 

log 5 


where S = 3 is the number of regimes and pi is the fraction of tips in each regime. Evenness of regime 
representation along the branches, Jb, was computed by applying the same Shannon diversity index 
formula to the proportion of total branch-length in each regime. Both Jt and Jb vary between 0 and 1; 
higher values correspond to more even distribution of regimes. As Fig. |A3| shows, no strong correlations 
between tree size and these metrics were in evidence among the trees we used. 


Simulating the phenotypic data. For each parameter combination and regime painting of the large tree, 
we generated 40 phenotypic datasets using the stochastic 3-regime OUCH model. We assumed equal 
distances between adjacent optima. With this assumption, location of the optima is inconsequential once 
the scale of X has been chosen; we therefore set Qa = — 1 without additional loss of generality, and 
Ob = 0 ,0c = F 


Model fitting. To evaluate model-selection power, we fit a panel of models to the simulated data, as¬ 


suming known phylogeny and regime painting, using the ouch R package (King and Butler 2009). In 
particular, we considered Brownian motion (BM), a single-optimum OU model (OU1), two 2-regime 
OU models (OU2ab, OU2bc), the true model (OU3), and a fully non-phylogenetic model (NP). The two 
2-regime OU models are nested within the OU3. In the OU2ab, the two large-value regimes (B & C) 
are collapsed into a single regime; in the OU2bc, regimes A & B are collapsed into one. The NP model 


is equivalent to the “white noise” model implemented in geiger (Harmon et al. 2008b Hunt 2006). 
Under it, each species’ phenotype is modeled as an independent sample from a normal distribution with 
mean depending on the tip regime for that species and common variance. 

Maximum-likelihood model fitting was performed using a combination of Nelder-Mead and sub- 
plex optimization algorithms ( |King[ [2008j Rowan, fl990| ). We performed model selection using the 
Akaike Information Criterion corrected for small sample size (AIC c ; Burnham and Anderson |2002 1 . 
The best model was deemed simply to be that with the lowest AIC c ; this ignores the potential for am¬ 
biguous or inconclusive model-selection results which arise in practice when one or more models have 


AIC c scores differing by a few units only. Boettiger et al. (2012) offer a convincing demonstration that 
over-reliance on information criteria can lead to unreliable conclusions. They very sensibly advocate a 
case-by-case approach in which the parametric bootstrap is used to compute more accurate P-values for 
information-criterion-based model selection. Obviously, such a case-by-case, bootstrap-based approach 
would be too computationally expensive for our purposes. However, we employ this bootstrap analysis 
below in the examination of a few illustrative cases. Furthermore, we anticipate that the breakdown 
of our raw-AIC c approach will be most common at the frontiers dividing high-power from low-power 
parameter regions and that our principal conclusions will be robust to the choice of model-selection 
criterion. 
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We note that our assumptions of known phylogeny and regime painting ignore potentially large 
sources of uncertainty and therefore make our estimates of statistical performance optimistic: power 
and accuracy are almost certainly overestimated. However, as assessment of power was a primary goal 
of the study, it was essential that the true regime painting be included among the competing hypotheses. 
Moreover, since we initiated the iterative optimization algorithms at the true parameter values, our re¬ 
sults likely understate the effects of estimation error, a sometimes nontrivial issue in actual data analysis. 
Again the effect is to make our assessment of statistical performance optimistic. 


Assessing power and accuracy. Power was calculated at each parameter-value/phylogeny/painting com¬ 
bination as 


power = 


R+\ 


(3) 


N + V 

where N is the number (40) of replicate phenotypic datasets and R is the number for which the true 
evolutionary model was better supported than the alternatives. Eq. [3] is slightly more stable than R/N 
and disallows the possibility of power = 0 or 1 . 


We assessed the accuracy of parameter estimates when the true model was selected using mean 
squared error (MSE): 

MSE(/3) = i^(A-/?) 2 , 

i 

where (3 stands for the true value of any one of the model parameters, dj, is the estimate in the ith 
replicate, and the sum is taken over the R replicates for which the true model was best supported. In 
restricting attention to parameter estimates from the best-supported model only, we reasoned that an 
investigator using this method would be uninterested in parameter estimates from unsupported models. 


We explored the dependence of power and MSE on 77 , 0, tree size, tree imbalance, and the two 
regime-evenness metrics, tip evenness, Jt, and branch evenness, Jb- We quantitatively assessed these 
relationships using generalized additive model (GAM) regression of power on these predictor variables 

are given in Appendix |Bj 


(Wood 2006, via the mgev R package). Summaries of the best-fitting GAMs and several alternatives 


Results 

Model selection power. Overall, the simulation study encompassed regions of high, low, and moderate 
power. In some regions of parameter space, power approached 100 percent, especially when selection 
opportunity rj and discriminability ratio cj) were high, and 20 or more taxa were included (Fig. [2]). Other 
regions of parameter space had very low power, such that the true model was rarely selected. The GAM 
regressions indicated that the strongest predictors of power were rj, <f>, tree size, and the evenness of 
regimes across the branches. Unexpectedly, the effect sizes for tip evenness and tree imbalance were 
both very small and statistically significant only because of the large sample size. In other words, 
evenness in the proportion of evolutionary time spent in the various regimes was more predictive of 
power than were balance in the shape of the tree or evenness in the the number of tip taxa in each 
regime. 

Unsurprisingly, the support for each of the alternative (incorrect) models varied strongly with se¬ 
lection opportunity rj and discriminability ratio cj) (Fig. [2]). Support for Brownian motion models was 
high only when selection opportunity 77 was very small. The OU1 model had highest support when 
selection opportunity was large but the discriminability ratio <f> was very small, i.e., when selection was 
strong but the noise was nevertheless strong enough to obscure differences among the optima. The two 
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FIGURE 2. The frequency (fraction of 40 phenotypic datasets) that each of the alter¬ 
native models was best-supported as a function of the dimensionless simulation study 
variables: selection opportunity 77 , discriminability ratio 0 , and tree size. 


alternative OU2 models found high support at moderate values of 77 and <p >, whereas the true OU3 model 
had high support for larger values of 77 and q i. 


The regions where alternative models have high support (Fig. [2]) have a particular shape, suggesting 
that some function of 77 and cp will be informative for power. Figs. [3] and A5 show that, to a good 
approximation, the dimensionless signal-to-noise ratio (SNR), ^/tj 0 , together with tree size, is such 
a determinant, especially when cp is large. Fig. [3] shows how power varies with 77 and cp for different 
combinations of tree size and branch evenness. The correspondence between the contours of power 
and those of SNR in Fig. [^indicates that SNR is a useful predictor of power. Fig. |A5| shows power as a 
function of SNR and tree size for a range of branch ( Jb) and tip (.Jr) evenness scores. Tree size plays an 
important role in determining power: for very small trees, power is low except at high values of SNR. 
This figure also emphasizes the relatively small contribution regime evenness makes to determining 
power. SNR is not a perfect determinant: using 77 and 0 together independently as predictors gives a 
better GAM fit to the data than using SNR alone, but the improvement in R 2 is slight (0.939 vs. 0.937, 
Appendix |B]). 


Parameter estimation accuracy. The GAM regressions indicated that, as was the case for model- 
selection power, 77 , cp, free size, and branch evenness were the largest determinants of parameter estima¬ 
tion accuracy. Figs. [4]-[6] depict parameter estimation error based on the best-fitting GAM regressions. 
Accuracy of the estimates 6 C , (p, and 77 are presented in terms of relative root mean square error (RMSE). 
Errors in estimates of the other two optima (Figs. |A 6 | and A7 1 show a pattern similar to that in Fig. [4] 







11 



FIGURE 3. Power as rj and <i> are varied for a range of branch evenness (J/s) and tree 
size values. Red corresponds to high (>50%) probability of selecting the true model 
(blue <50%), smoothed using GAM regression, assuming perfectly even tips and a 
highly balanced tree (imbalance metric of -0.5). Fig. |A4| in Appendix [A] shows that 
power is little affected by the tip evenness and tree imbalance. Contours show the 
levels of SNR = y 7 rj 4>. 


Accuracy of estimates of the optima display a dependence on the selective regime at the root which we 
describe in the Discussion. 

Several important patterns are evident in Figs. [4]-[6] Of these parameters, it is clear that the se¬ 
lective optima are the best estimated. Relative errors of 100% or more are only present for very small 
values of </> and are rare when the number of tips is large. The discriminability ratio <i> is also relatively 
well estimated, provided the tree is of sufficient size and </> is not very small. Selection opportunity is 
substantially more difficult to estimate accurately: when either rj or cj) is small, relative errors exceeding 
100 % arc common, even when the correct model has been selected. 
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No. tips = 10 No. tips = 30 No. tips = 50 



01 234501 234501 2345 

Selection opportunity r| 


FIGURE 4. The relative root mean square error (RMSE) in the estimate of the selective 
optimum 9c as rj and 0 arc varied for a range of tree size and branch evenness val¬ 
ues. Red hues correspond to less than 100% relative error; blue, to greater than 100%. 
Predictions are based on the best-fitting GAM regression. Contours show the levels of 
SNR. 


Discussion 

This study investigated the statistical properties of methods for phylogenetic comparative analysis 
based on models of the OUCH class. Previous studies have suggested that OU-based methods work 

signal is strong (i.e., many taxa sampled and large SNR, e.g., 
). This study explored more problematic regions of parameter 
space with the goal of identifying where, how, and why these methods break down as sample size 
or SNR become small so that users can apply them with greater understanding of their strengths and 
weaknesses. 


reliably when sample sizes are large and 


Beaulieu et al.| 2012} Ho and Ane[ 2014 
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FIGURE 5. Relative RMSE in the estimate of selection opportunity (fj) as rj and o are 
varied for a range of tree size and branch evenness values. Red hues correspond to less 
than 100% relative error; blue, to greater than 100%. Contours show increasing SNR. 
Predictions are based on the best-fitting GAM regression. 


Our results show that model-selection power can be high, even when sample sizes are small, or 
when the magnitude of the deterministic pull towards the optimum is small. The critical determinants of 
power in our analysis are what we have termed the signal-to-noise ratio and the discriminability ratio, 
both of which involve the distance between optima relative to the stationary variance of the OU process. 
We found that parameters can in some circumstances be well estimated, but that care should be taken in 
interpreting estimates, as parameters can be poorly identified even when model selection is robust. We 
discuss these issues further below. 


The utility of OU-based methods for comparative analysis is indicated by the number of extensions 
that have been implemented in recent years. This includes methods that allow the optima themselves 
to evolve along the phylogeny according to a Brownian motion process, obviating the need to specify 
regimes on the phylogeny (SLOUCH, Hansen et al. 2008), methods that allow for multivariate traits 
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FIGURE 6. Relative RMSE in the estimate of discriminability ratio (0) as rj and 0 are 
varied for a range of tree size and branch evenness values. Red hues correspond to less 
than 100% relative error; blue, to greater than 100%. Contours show the levels of SNR. 
Predictions are based on the best-fitting GAM regression. 


(Bartoszek et al. 

2012 ;; 

King and Butler 

2009|), methods that allow a and/or a to vary across the 

phylogeny (OUwi 
history of regime 
of our study appl; 

e, Beaulieu et ah, 

2012 ), and methods that attempt to reconstruct the evolutionary 

changes (Hipp and Escudero, 2010; Ingram and Mahler 2013). The conclusions 
/ with equal force to all implementations of the OUCFI class of models and have 


implications for methods based on generalizations of the OUCFI class. 


Distinguishing among competing hypotheses. Our results show that, when selection opportunity, ?y, 
and discriminability ratio, </>, are both sufficiently great and the tree is large enough, the signal of adap¬ 
tation is strong and we reliably recover the true model (Figs. [2] [3] and A51. It is important to note, 
however, that when the separation of the optima is small relative to the fluctuations in the evolutionary 
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process (small cj>), it is difficult to infer the model accurately, regardless of the value of selection oppor¬ 
tunity, rj = aT (Fig. [3]). On the other hand, if cj) is sufficiently large, power can be high even when rj 
is small. In the latter situation, to a good approximation, tree size and the signal-to-noise ratio, rj cf>, 
predict model selection power (Fig. |A5| ). 

In general, it was only when r] and © were both sufficiently large that the OU3 model was reliably 
recovered. Certain other regions of parameter space produced data frequently mistaken as having been 
generated by one of the simpler alternative models (Fig.[2]). Brownian motion found support only when rj 
was quite small; the single-optimum OU model, only when © was small. The two-optimum OU models 
found some support within a band of q-6 combinations for which the phenotypic distributions of species 
in neighboring selective regimes were likely to overlap, either because selection opportunity was low 
but discriminability ratio high, or because selection was high but discriminability low. Over the range of 
parameters we examined, the non-phylogenetic model had very little support. We anticipate, however, 
that at higher levels of r) than were explored here, the non-phylogenetic model would find more support. 
This is because when rj is so large that adaptation is essentially instantaneous, the evolutionary history 
of each lineage becomes irrelevant. In this situation, the non-phylogenetic model and the true OU model 
will each fit the data equally well; the non-phylogenetic model, being simpler, will be preferred by AIC. 
In this region, the preference of AIC for the NP model is not evidence against the adaptive hypothesis. 
Thus, large values of r] do not pose challenges to the detection of adaptive evolution. 


Finally, although investigators with small datasets may find that increasing sample size rapidly 
improves power, hope remains for studies based on small numbers of species (Fig. A5 I. We found that 
power can be high when SNR is large even with as few as 10 species. Similarly, Ho and Ane (2014 1 
showed that a dimensionless effect size, closely related to our discriminability ratio, was a much better 
predictor of power than the number of taxa. Moreover, they found that when discriminability ratio is 
small, power can not be greatly improved by increasing the number of taxa sampled, even to very large 
numbers. This finding echoes our own results (Figs.[3]and A5 1 . 


Reliability of parameter estimates. Our results show that the parameters differed in the accuracy with 
which they could be estimated. Relatively accurate estimates of cj) were common when trees were of 
moderate size and cj) was not very small (Fig. [5]). Selection opportunity, rj, was the most difficult to 
estimate (Fig. [6]): relative RMSE was less than 50% only when both selection opportunity and discrim¬ 
inability ratio were high. 


It is noteworthy, however, that relative error of 50% and more were common in many regions of pa¬ 
rameter space. Thus, our results call for caution in the interpretation of parameter estimates for OUCH- 
class models. Moreover, since as our results show, the quality of parameter estimates varies markedly 
across parameter space, a case-by-case approach to the quantification of uncertainty in parameter es¬ 


timates is essential. For this purpose, approaches based on the parametric bootstrap (Boettiger et al. 
2012 Scales et al. 2009|) are recommended. 


Estimates of the trait optima tended to be more accurate than estimates of r/ and 4>, reflecting the fact 
that the three optima must be distinguishable for the OU3 model to be well supported. We noted that the 


identity of the regime operating at the root influenced the estimates of the selective optima (Fig. A 81 . 
In particular, the best-estimated regime was the regime operative at the root and the worst-estimated 
regime was that with the optimum farthest from that of the root regime. 


Ho and Ane] (2013) demonstrated mathematically, for the single-optimum OU process on an ul¬ 
trametric tree, that the parameters a and a 2 can be consistently estimated, but that the estimate of a 1 
converges faster than does the estimate of a. [An estimator is said to be consistent when estimates 
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FIGURE 7. Typical contour plot of log likelihood against 77 and <j>. Greyscale and con¬ 
tours show log likelihood relative to its maximum value. For each 77 and 0 combination, 
the log likelihood was calculated for the OU3 model without fitting the model to the 
data. The data generated were for a tree of size 39, with true 77 and <fr shown by the 
asterisk. 


converge to the true value as the number of data increases.] This parallels our result that, though the 
estimates of both 77 and cf> improve with tree size, 0 is always better estimated (Figs. [5] an d©- 


In later work, {Ho and Ane[ ( |20l4| proved that selective optima can be consistently estimated as long 
as, for some selective regime, the branches in that regime do not form a connected subtree. Our finding 
that the accuracy of trait-optima estimates increases with the size of the tree (Fig. [4]) is consistent with 
the results of Ho and Ane (2014), since ah trees considered in our study met that criterion. We expect 
that accurate estimates of the trait optima would be impossible otherwise. 


Parameter estimation and the topography of the likelihood surface. It is perhaps striking that, in 
certain regions of parameter space, model selection can be quite reliable, yet parameter estimates fraught 
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with error (cf. Figs. [3]-[6]). This is due to inherent limits in the identifiability of the parameters, i.e., to 
limits in the information content of the data. Fig. [7] shows the shape of the likelihood surface in one 
typical case. The relatively high curvature in the 0 direction corresponds to the fact that estimates of 
that parameter can be made with some precision. The likelihood ridge extending in the 77 direction 
implies that many values of 77 are roughly equally consistent with the data: selection opportunity is 
much less well identified. The existence of likelihood ridges has been noted since Hansen’s original 
(1997) paper, but is still not widely appreciated ( Ho and Ane[ 2014). However, since model selection 
depends on the maximum height attained by the surface, and not on its shape, discrimination between 
models can be sharp even when parameter estimates are not. Such ridges in likelihood space lead to 
parameter estimates with large confidence intervals and caution against over-interpretation of parameter 
estimates. It is in general worthwhile to explore the likelihood surface for a given dataset by evaluating 
the likelihood at parameter values near the MLE. 

The foregoing observations have implications for more parameter-rich methods based on general¬ 


izations of the OUCH class in which parameters are allowed to vary across the tree (Beaulieu et al. ( 
2012} O’Meara et al.| 2006 1 . Our study shows that the information contained in the data are already 
markedly limited when these parameters are constant. Increasing the number of parameters can only 
further dilute information, leading to even greater uncertainty in estimates. 


Assessing reliability in practice. While our simulation study sheds some light on how phylogenetic 
comparative analyses based on OUCH models can be expected to perform in principle, some care must 
be taken in translating these results into practical recommendations. For example, having fit an OUCH 
model to one’s data, one might be tempted to map one’s estimates 77 , <t> onto Fig. [ 3 ] in an attempt to 
gauge the reliability of one’s model selection result. Similarly, one might be tempted to assume that 
unambiguously strong support for an adaptive model implies reliability of parameter estimates. Our 
results suggest that these temptations should be resisted, as inaccuracy in parameter estimates can lead 
to misleading conclusions. Moreover, such inaccuracy may not be readily detectable. We illustrate the 
variety of potential outcomes by focusing on three scenarios (Fig. [ 8 ]). 


In the optimistic Scenario A, 77 and 0 are both large and the true values of these parameters are well 


within the 95% bootstrap confidence interval. The phylogenetic Monte Carlo distributions (Boettiger 


et al.[ |2012[ Scales et al.| 20091 indicate that the AIC c -based rejection of the simpler alternatives i 


is 


justifiable at a high confidence level. This scenario accords with our expectation (based on Figs. [ 
and A5 1 : when the true values of 77 and </> are both large, model misidentification is rare and parameter 
estimates tend to be accurate. 


Scenario B is likewise optimistic. Here, though according to our AIC c criterion the OU3 model is 
best supported, fj and (p are both low. In accordance with the simulation study results, bootstrap confi¬ 
dence intervals are large. The simulation study also suggests, however, that in this region of parameter 
space, selection of the OU3 model should be rare and unreliable. This is borne out by the phylogenetic 
Monte Carlo analysis, which shows that the observed AIC c difference frequently occurs even when the 
data are generated by the simpler models. Thus, in this scenario, both the parameter estimates and the 
phylogenetic Monte Carlo results warn us against placing too much confidence in our selection of the 
OU3 model. 


Scenario C yields a cautionary tale. As in scenario B, the data were generated in a regime where, 
according to the simulation study, reliable recovery of the data-generating model should be rare. For 
the particular data observed in this scenario, however, OU3 was indeed best supported, with parameters 
estimated as shown. These parameter estimates are far from the truth, incorrectly suggesting that we 
are within the high-power, high-accuracy regime. Nor does phylogenetic Monte Carlo (Fig. [8]2) give 
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us any hint of our mistake. Scenario C was rare in our simulation study (for example, fewer than 10% 
of datasets with true rj and <i> less than 2 resulted in estimates r) and <j) larger than 2 and bias greater 
than 10%). Its existence, however, urges a degree of caution in the interpretation of parameter estimates 
in this model and generalizations of it that are even more parameter-rich. This scenario illustrates that 
while parameter estimation and model selection are intimately related, it is possible to have wildly 
inaccurate parameter estimates while model selection works well. 


The utility and limitations of the dimensionless model. Our analysis shows that the model selection 
power and parameter estimation accuracy are largely determined by the dimensionless parameters r] 
and (j> and by the number of taxa sampled. Because these quantities arc dimensionless, they can be 
meaningfully interpreted and compared across studies, whereas the interpretation and comparison of 
parameters like a and a are complicated by their dependence on the measurement scales of time and 
trait (see also Ho and Anc. [2014| ). 


Although the formulation of the models in terms of dimensionless parameters is useful, it does not 
translate directly into a prescription for data analysis. In particular, we note that, although rj = aT 
is unambiguous if the tree depth, T, is known, the discriminability ratio, 0, depends on the unknown 
quantity A 9. Moreover, as we saw above, the potential for inaccurate parameter estimates means that, 
even having estimated the optima (and hence A9), one cannot safely conclude that one is well within 
a region where estimates are reliable and power is high. In effect, the simulation study described here 
maps out an important portion of the model parameter space. In contemplating the use of OUCH-class 
methods for data analysis, we do well to understand this map, even if we must live with the fact that we 
cannot always know with certainty where we are on it. 


Beyond OUCH. When OUCH models are used for the purpose of hypothesis testing, a difficulty of¬ 
ten encountered is that the biological hypothesis one wishes to test cannot be uniquely translated into 
a single regime painting. For example, it is often the case that the history of selective regimes deep 
in the phylogeny is not known and various regime paintings are therefore plausible. There is no diffi¬ 
culty when a hypothesis does entail a unique regime painting: it is a straightforward matter to obtain 
maximum-likelihood (ML) parameters and a maximized likelihood that can be used to compare the 
hypothesis against alternatives. Should this likelihood be sufficiently low, one may reject the original 
hypothesis, precisely because the regime painting is its necessary consequence. However, if the hy¬ 
pothesis is compatible with more than one regime painting, rejection of any particular painting does not 
logically force rejection of the hypothesis. In other words, in such a situation, uncertainty concerning 
the regime painting remains. Many recent methodological developments have focused on accounting 


for this uncertainty in the regime painting (Eastman et al. 2011; Hipp and Escuderol 2010; Ingram and 

Mahler} 

2013, Revell et al. 2011} Uyeda and Harmon} 

2014 

). It is worth considering here how such 


uncertainty can properly be taken into account. The two major branches of statistical philosophy offer 
two closely related approaches. 


A Bayesian approach would be to treat the regime painting, P, as a additional parameter, assign 
priors, and perform inference as usual ( Bollback] 2006). The key ingredient in doing so would be the 
proportionality relation 

7 r(0 ,P|A)oc/(A|0,P)7r(0,P), (4) 

where X is the data, 0 stands for the parameters of the OU (or BM) phenotype model, 7r(0, P\X) is 
the posterior distribution, 7r(0, P) is the joint prior on P and 0, and /(Aj0, P) is the OU (or BM) 
likelihood. One drawback of this approach is the need to specify priors on 0 and P even when little 
or no prior information on these parameters exists. The methods proposed by Revell et al. (2011 1 , 
Eastman et al. (2011 1 , and|Uyeda and Harmon (2014) adopt this approach. If one is willing to propose a 
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FIGURE 8 . Three scenarios illustrating various parameter estimation and model selection out¬ 
comes, as described in the text. Scenario A is drawn from a region where power is high and 
parameter estimates reliable. Scenarios B and C are drawn from regions where power is low 
and parameter estimation unreliable. In Scenario B, both the true parameters and the estimates 
lie in regions of low power. In Scenario C, however, the overestimation of rj and 0 tempts one 
to the specious conclusion that the true parameters are in a region of high power and accuracy. 
The plots at left indicate true parameter values (asterisks) and ML estimates (filled circles). The 
crosshairs shows 95% bootstrap confidence intervals. The plots at right show the phylogenetic 
Monte Carlo results for pairwise comparison of alternative models based on 1000 bootstrap 
replicates. The density plots show the distribution of AIC c differences between the indicated 
models (OU3-simpler model). In each case, the light gray density shows the distribution of 
AIC c differences when the data-generating process is OU3 at its MLE; the dark gray differ¬ 
ences result when the data-generating process is the simpler model at its MLE. The vertical 
dashed line shows the observed AIC c difference on the original data. Comparing the observed 
AIC c difference to the dark gray density gives an indication of the statistical significance of 
the observed AIC c difference. Lack of overlap between the light gray and dark gray densities 
indicates high power of the hypothesis test. 
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probabilistic model for P, however, one arrives at a hierarchical model, for which the key proportionality 
relation is 

vr(0, P\X) oc f(X |©, P) g(P\$) t r(0, $), (5) 

where g(P |3>) is the likelihood of regime painting P under the parameters ( I>. One still has the difficulty 
of assigning priors over 0 and <I>, but this may be somewhat more straightforward than doing so over 
the regime paintings P. 

The ML approach is to integrate over the uncertainty in P. Specifically, one seeks parameters 0 
and <f> that maximize the likelihood 


£(X|0,cf») = ^/(X|0,P)p(P|$), 


( 6 ) 


the sum being taken over all regime paintings P. Because the space of paintings will often be large, a 
Monte Carlo approach to the approximation of the sum may often be needed in practice. As with the 
Bayesian approach, accurate estimation requires adequate exploration of the space of P. Though the 
ML approach requires that one must maximize the likelihood (Eq. [6]), which can sometimes be difficult 
relative to the problem of sampling from the posterior, it has the advantage that no priors are needed. 

In other words, the maximum likelihood framework we have applied in this paper can be extended 
to deal with regime-painting uncertainty. One must translate one’s hypothesis into a probability distribu¬ 
tion over regime paintings. Then, to compute the likelihood at any given set of parameters, one proceeds 
as follows. First, generate a large number of regime paintings from the regime-painting model. For each 
painting, P, compute the conditional likelihood of the data under the phenotype model (e.g., of OUCH 
type), given P. The average of these conditional likelihoods is the required likelihood. This recipe gives 
the likelihood at a given point in parameter space: to complete the inference, it remains to maximize it 
over all values of the parameters. 


While certain theoretical advantages attach to the Bayesian or ML prescriptions (both of them based 
on likelihood), other approaches are of course possible. One might, for example, propose a method that 
seeks the “true” regime painting by searching over the space of all such paintings in quest of the one that 


yields the highest maximized likelihood, lowest AIC, or is by some other criterion deemed best (Ingram 


and Mahler, 2013|). One worries that, since the size of the parameter space is large, such a method 


would suffer from overfitting and statistical inconsistency of estimates. Another alternative might be to 
propose a large number of paintings, maximize the parameters 0 over each one, and somehow average 
the resulting parameters and maximized likelihoods (Collar et al. 2009 2010[ Hipp and Escudero 


2010). Note that this is not at all the same as the integration over the uncertainty in P performed under 


either of the likelihood-based prescriptions: the maximization and averaging steps are transposed. 


Clearly, there is growing interest in the development and use of sophisticated phylogenetic com¬ 
parative methods based on models more complex and realistic than the OUCH-class models studied 
here. While our results inform some expectations, a thorough study of the statistical properties of these 
methods would be invaluable. Overall, we suggest that it is imperative for biologists to understand not 
only the potential of these methods, but also their inherent limitations. 
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Appendix A. Supplementary figures 


Fig. A1 is a schematic diagram of the simulation study design. Fig. A2 shows the distribution of 
selection opportunity 77 , discriminability ratio 0, and tree size explored in this simulation study. As 
explained in the main text, we used a Sobol’ low-discrepancy sequence ( Press et ah] |l992) to generate 
220 ( 77 , 0, tree size) triples. This samples in such a way that each variable is sampled uniformly across 
its range and such that combinations of variables are also sampled uniformly. This range of selection 
opportunity 77 values corresponds to values of a ranging between 0.2 and 5 (because 77 = a T). Cor¬ 
respondingly, this implies that the depth of the phylogeny, ranges between 0.29 and 7.2 phylogenetic 
half-lives (Hansen| 1991) . This range of discriminability ratio 0 values corresponds to a range of noise 
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Figure A1. Plan of the simulation study. 
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FIGURE A2. Scatterplots showing the distribution of selection opportunity 77 , discrim- 
inability ratio </>, and tree size values explored in this study. 


intensity a values between 0.13 and 16 (because <f> = y/ ^* A6> ). Correspondingly, this implies that the 
separation between optima varies between 0.2 and 5 standard deviations of the stationary distribution 
around each optimum. 


To get a sense of how how much variation there is in tree shape among the subtrees sampled 
from the large Anolis phylogeny, we show scatterplots of tree size, our regime evenness metrics Jb 
and Jt, tree imbalance (the standardized Colless index, iMooers and Fleard 1997), and the number 


of tips in each regime across all 65600 subtrees used in this study (40 regime paintings x 40 subtree 
specifications x 40 tree sizes, Fig. A3 1 . These scatterplots show that there are no discernable biases in 
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FIGURE A3. The distribution of tree shape statistics across all subtrees included in this 
study. Tree size is number of tips. The Jb, Jt, and imbalance metrics are described in 
the text, n, refers to the number of tips in regime i, i F {A, B, C}. 


how we generated trees, in terms of the effect of e.g., size on tree imbalance or regime evenness. There 
is an interesting, and not entirely unexpected, positive correlation between branch and tip evenness 
(correlation coefficient of 0.54). This reflects the fact that trees with highly balanced regimes across 
the internal branches are more likely to have highly balanced regimes across the tips. This correlation 
is at least a partial explanation for why branch evenness is a more important predictor variable than tip 
evenness in our GAM regressions. We attempted to account for this correlation in our regressions, for 
example by using residual tip evenness (observed tip evenness minus the expectation based on the linear 
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FIGURE A4. Power as a function of selection opportunity rj and discriminability ratio 
<j) for a range of tree imbalance and tip evenness values. Contours show increasing SNR. 
Predictions are based on the best-fitting GAM regression with tree size held constant at 
30 and branch evenness held constant at 1.0. 


regression of tip evenness on branch evenness), but this did not provide a better fit to the data or even 
increase the statistical significance of tip evenness as an explanatory variable. 


Fig. A4 shows heatmaps of power across variation in tree imbalance and tip evenness values. Tree 
size and branch evenness were both held constant for these figures. The predictions arc based on GAM 
regressions of observed power against r], ©, tree size, tree imbalance, and regime evenness (Jn and 
Jt). The contours show increasing signal-to-noise ratio (SNR, yff\ <j>). One can see that neither tree 
imbalance nor tip evenness has any discernable effect on predicted power. The significance of these two 
variables in the GAM regression is largely due to the large sample sizes rather than to effect size. 


Figs. A 6 and A7 show heatmaps of predicted root mean square error (RMSE) in the estimates of 
9a and 9 b as // and <f> are varied, for a range of values of tree size and branch evenness. Tip evenness 
and tree imbalance were held constant. These predictions are based on GAM regressions of observed 
RMSE against 77 , <i>, tree size, tree imbalance, and regime evenness (Jj> and Jt). The contours show 
increasing signal-to-noise ratio (SNR, y/rj 0 ). Comparing these plots to Fig. |4]in the main text, one sees 
that although the response to changes in predictor variables is the same, 9c is much better predicted 
than 9a and 9b■ We discuss this result in more detail in the main paper. It has to do with the fact that 
the regime present at the root is better estimated than any of the other optima; by chance, 18 of the 40 
regime paintings we used happened to have the root in regime C, compared to 12 paintings with the 
root in regime B and 10 with the root in regime A. Controlling for the identity of the root regime, the 
response of RMSE to changes in predictor variables is nearly identical across selective optima (Fig. A 81 . 


SNR does a reasonably good job as a predictor of the error in the estimates of the selective optima 
(Figs.|4} |A6l and |A7| ). Flowever, it is not a good predictor of either rj or 0 , as suggested by the lack of 
correspondence between predicted error and SNR in Figs. [5] and [ 6 ] These conclusions are supported by 
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FIGURE A5. Power as SNR and tree size are varied for a range of branch and tip 
evenness values. Power is predicted using the best-fitting GAM regression equation. 
Red corresponds to better than 50% probability of selecting the correct model; blue, to 
less than 50%. 


comparing the R 2 values of the GAM regressions using SNR instead of // and 0 separately (77 regression 
R 2 drops from 0.66 to 0.20, 0 regression R 2 drops from 0.69 to 0.48, and 9c regression R 2 drops from 
0.45 to 0.43). Fig. A12 shows RMSE in the estimate of 9c as SNR and tree size varies, for a range of 
tip and regime evenness values. 
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FIGURE A6. The root mean square error in the estimate of the selective optimum 6a 
as rj and & are varied for a range of tree size and branch evenness values. Contours 
show increasing SNR. Predictions are based on the best-fitting GAM regression. 
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FIGURE A 7 . The root mean square error in the estimate of the selective optimum 6 j> 
as rj and & are varied for a range of tree size and branch evenness values. Contours 
show increasing SNR. Predictions are based on the best-fitting GAM regression. 
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FIGURE A8. Root mean square error in the estimate of each selective optimum across 
a range of q and <f> values, after accounting for the identity of the root regime. These 
heatmaps are the predictions of RMSE based on GAM regressions to subsets of the full 
simulation study that consider only those regime paintings where the root was in each 
given regime. For these predictions, tree size was assumed equal to 30, branch and tip 
evenness were assumed equal to one, and tree imbalance was assumed equal to -0.5. 
Each regime is best estimated when the root is in that regime. 
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Figure A9. The relative root mean square error in the estimate of discriminability 
ratio (j) as rj and <fi are varied for a finer range of tree size values. 
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FIGURE A10. The GAM-predicted absolute root mean square error in the estimate 
of selection opportunity r] as // and <f> are varied for a range of tree sizes and branch 
evenness. As in the main text, tree imbalance and tip evenness are assumed fixed at 
-0.5 and 1.0, respectively. 
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FIGURE All. The GAM-predicted absolute root mean square error in the estimate 
of discriminability ratio <i> as r/ and <j> are varied for a range of tree sizes and branch 
evenness. As in the main text, tree imbalance and tip evenness are assumed fixed at 
-0.5 and 1.0, respectively. 
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FIGURE A12. The root mean square error in the estimate of the selective optimum 
9c as SNR and tree size are varied, for a range of tip and regime evenness values. 
Predictions are based on the best-fitting GAM regression. 
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Appendix B. Generalized additive model regressions 


We fit generalized additive models (GAMs) to capture the dependence of observed power and mean 
square error on the simulation study parameters ( 77 , q i>, tree size, tree imbalance, and regime evenness). 
In the GAM framework, the location of the response variable is assumed to be a smooth function of 
the predictor variables. GAMs are appropriate here because we had no expectation as to the shape of 
the dependence of the response variable on any of the predictors. The GAMs accomodate this need 
for flexibility using thin plate regression splines to describe the relationship between the response and 
the predictors (Wood 2003[ 2006). Thin plate regression splines are simply splines with two or more 
basis dimensions, where the “wiggliness” of the spline is penalized to avoid overfitting. A nice physical 
interpretation of thin plate regression splines is to visualize the relationship between the response and 
two predictor variables as a rugged landscape where the height of the landscape gives the value of 
the response variable at combinations of predictor variables. The thin plate regression spline can be 
visualized as a thin sheet of metal that is being bent to fit the shape of this landscape; the rigidity of 
the metal is analogous to the penalty applied to the spline to avoid overfitting. Fits were accomplished 
using the gam function provided in the R package mgcv fWood[ 2006 1 . This function allows different 
predictors to be added, removed, or combined and the resulting fits to be compared via analysis of 
deviance, as with generalized linear models. 


In the tables below, we display analysis of deviance tables comparing several GAMs for each 
response variable (power, relative RMSE in the estimates of 77 and <p, RMSE in the estimates of the 
selective optima). There are a number of conclusions we can draw. First, the best-fitting model was 
the same for all response variables. This model predicts the response as an additive combination of 
smooths: a two-dimensional smooth of branch and tip evenness; a one-dimensional smooth of tree 
imbalance; and a three-dimensional smooth of the simulation parameters ( 77 , cj), and tree size). Second, 
the largest improvement in model fit comes from adding branch evenness as an additive fixed effect. 
There is a significant improvement from adding fixed effects of tree imbalance and tip evenness, but 
these only slightly improve model R 2 . Allowing the response to be a smooth function of the tree shape 
and regime evenness metrics improves the fit, especially for the estimates of the selective optima. 

We also compare the R 2 for the best-fitting models using selection opportunity 77 and discriminabil- 
ity ratio ([> as separate predictors to the best-fitting models using the composite parameter of signal-to- 
noise ratio (SNR: y/fj 6) as a predictor (Table |B7| . It is clear from this comparison that models using 
SNR as a predictor perform nearly as well as models with 77 and 0 separately for power, but that the 
ability to predict estimation accuracy is substantially worse. 
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Table B1. Comparison of GAM regression fits of observed power p. Power was logit- 
transformed prior to analysis. This transformation allows the response to vary between 
(— 00 , 00 ) instead of being bounded between 0 and 1, making it easier to tit the GAM 
to the data. 




Resid. Df 

Resid. Dev 

Df 

Deviance 

R z 

log ( 

y-2- Vs(be,te)+s(imb)+s(77, (j>, size) 

351854 

206658.00 



0.93866 

log ( 

)~s(be)+s(te)+s(imb)+s(?7, d>, size) 

/ P \ 

351865 

206831.00 

-12 

-173.78 

0.93861 

log ( 

J~be+te+imb+s(? 7 , <j>, size) 

351887 

207431.00 

-22 

-599.84 

0.93844 

log ( 

J'be+imb+sfry, </>,size) 

351888 

207494.00 

-1 

-63.08 

0.93832 

log ( 

^ Vbe+s(? 7 , </>,size) 

351889 

207836.00 

-1 

-341.79 

0.93832 

log ( 

v I^)' s (?7,0,size) 

351890 

256837.00 

-1 

-49000.90 

0.92378 


Table B2. Comparison of GAM regression tits of observed relative root mean square 
error in the estimate of selection opportunity //. Relative RMSE was log-transformed to 
deal with the occasional large value—the maximum observed RMSE was greater than 
10 3 . 




Resid. Df 

Resid. Dev 

Df 

Deviance 

K z 

logio 

f RMS ^ E M \ ~s(be,te)+s(imb)+s(? 7 , <p,size) 

271086 

13608 



0.65608 

logio 

f RMSE G) j ~s(be)+s(te)+s(imb)+s(? 7 , <f>, size) 

271097 

13619 

-10.43 

-11.89 

0.65579 

logio 

f RMSE ^ \ ~be+te+imb+s(r/, </>, size) 

271118 

13650 

-22 

-21.84 

0.65505 

logio 

f RMSE ^ j ~be+imb+s(77, </>,size) 

271119 

13650.10 

-1 

-0.15 

0.65504 

logio 

} R MSE(ri) \ -^e+s (jj, j Z e) 

271120 

13669.90 

-1 

-19.73 

0.65455 

logio 

( (r,) ) s(? 7 , </>,size) 

271121 

13792.50 

-1 

-122.63 

0.65145 


Table B3. Comparison of GAM regression tits of observed relative root mean square 
error in the estimate of discriminability ratio <i>. Relative RMSE was log-transformed to 
deal with the occasional large value—the maximum observed RMSE was greater than 
80. 





Resid. Df 

Resid. Dev 

Df 

Deviance 

K z 

iogio 

iogio 

f RMSE(4>) \ 

l <\> ) 

f RMSE(<P)\ 

l <fr ) 

~s(be,te)+s(imb)+s(? 7 , ^>,size) 

~s(be)+s(te)+s(imb)+s(? 7 , (f>, size) 

271087 

271097 

5417.5 

5419.3 

-10.76 

-1.87 

0.68922 

0.68913 

iogio 

( RMSE((j>)\ 
{ <t> ) 

'be+te+imb+sfry, </>,size) 

271118 

5438.3 

-20.56 

-18.95 

0.68806 

iogio 

( RMSE((j>)\ 

'be+imb+s(?y, ^>,size) 

271119 

5438.4 

-1 

-0.067 

0.68806 

iogio 

( RMSE(<P)\ 
{ (f> ) 

~be+s(? 7 , <fi,size) 

271120 

5438.5 

-1 

-0.17 

0.68805 

iogio 

( RMSE(<fi)\ 
{ <t> ) 

~s(ry, </>,size) 

271121 

5662.3 

-1 

-223.76 

0.67522 
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Table B4. Comparison of GAM regression fits of observed relative root mean square 
error in the estimate of the selective optimum 9 a- Relative RMSE was log-transformed 
to deal with the occasional large value—the maximum observed RMSE was greater 
than 35. 



Resid. Df 

Resid. Dev 

Df 

Deviance 

R z 

log 10 ( RMSE(dA))~s(be , te) + s(imb) + s( 77 , </>,size) 

271085 

20900 



0.47319 

log 10 ( RMSE(dA))~s(be ) + s(te) + s(imb) + s(r/, (j), size) 

271096 

20985 

-10.84 

-84.55 

0.47108 

log 10 {RMSE{0a))~'oz + te + imb + s(ry, ^,size) 

271119 

21820 

-23.58 

-835.88 

0.45006 

log 10 (RMSE(6 n))~be + imb + s( 77 , ^,size) 

271120 

21822 

-1 

-1.69 

0.45001 

log 10 (RMSE(6a))~ be + s( 77 , </>, size) 

271121 

21822 

-1 

-0.16 

0.45001 

log 10 {rMSE(Oa))~s(Vi & size) 

271122 

22174 

-1 

-351.74 

0.44115 

Table B5. Comparison of GAM regression fits of observed relative root mean square 
error in the estimate of the selective optimum Ob- Relative RMSE was log-transformed 


to deal with the occasional large value — the 

maximum 

observed RMSE was greater 


than 40. 







Resid. Df 

Resid. Dev 

Df 

Deviance 

R 2 

log 10 (RMSE(9B))~s(be- te) + s(imb) + s(ij, (f>, size) 

271085 

13104 



0.53186 

log 10 (i?AfS'i7(0B))"s(be) + s(te) + s(imb) + s( 77 , cj), size) 

271095 

13212 

-10.35 

-107.48 

0.52804 

log 10 (RMSE[9 s))~be + te + imb + s(rj, <p,size ) 

271118 

13394 

-23.52 

-182.43 

0.52157 

log 10 ( RMSE(9B))~be + imb + s(r], (j), size) 

271119 

13460 

-1 

-65.92 

0.51921 

log 10 (RMSE(6b))~^>^ + s( 77 , (j), size) 

271120 

13460 

-1 

-0.76 

0.51919 

log 10 ( RMSE(6b))~s(t 7 , (j), size) 

271121 

13607 

-1 

-146.68 

0.51395 

Table B6. Comparison of GAM regression fits of observed relative root mean square 
error in the estimate of the selective optimum 9c- Relative RMSE was log-transformed 


to deal with the occasional large value—the 

maximum 

observed RMSE was greater 


than 34. 







Resid. Df 

Resid. Dev 

Df 

Deviance 

R 2 

log 10 (RMSE(9b))~ s(be, te) + s(imb) + s(?y, 0 ,size) 

271093 

20598 



0.45232 

log 10 (RMSE(0 B ))~s(}oe) + s(te) + s(imb) + s( 77 , (j), size) 

271096 

20685 

-3.17 

-86.81 

0.45002 

log 10 (RMSE(9 s))~be + te + imb + s(rj, <p,size ) 

271119 

21398 

-22.74 

-712.93 

0.43111 

log 10 (RMSE(9B))~be + imb + s(? 7 , ^,size) 

271120 

21406 

-1 

-7.69 

0.43091 

log 10 (RMSE(dB))~be + s(? 7 , ^,size) 

271121 

21415 

-1 

-8.97 

0.43067 

log 10 (RMSE(d B ))~s(r], </>,size) 

271122 

21442 

-1 

-26.96 

0.42995 


Table B7. Comparison of R 2 values for GAM models using r/ and cj) as predictors to 
GAM models using the composite parameter SNR as predictor. 


Response 

model R 2 

SNR model R 2 

Power 

0.93866 

0.93669 

log 10 ( RMS y\ 

0.65580 

0.20236 

iogio(™r w ) 

0.68919 

0.48645 

log 10 (RMSE(9a)) 

0.47277 

0.43562 

log 10 (. RMSE(9 b )) 

0.53160 

0.47181 

log 10 (■ RMSE(9 C )) 

0.45232 

0.43272 



